Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This is a tutorial for implementing a simple machine learning pipeline aimed at machine learning beginners. In this notebook we will
It is assumed that you have some general knowledge on
In [1]:
# pythons scientific computing package and a random number generator
import numpy as np
import random
# machine learning classifiers and metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
# standard data sets in python
from sklearn import datasets
# create random train and test splits
from sklearn.model_selection import train_test_split
# plotting tool
import matplotlib.pyplot as plt
Ok, let's get the data, then and have a look at some examples. It seems that there is a lot of variation for some numbers there. Can you make a decision which number you see wih high confidence for each of the examples?
In [2]:
# load data
data = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data,data.target,shuffle=True,test_size=0.2, random_state=42)
In [3]:
#show examples from data set
#let's look at the first 5 examples
print(X_train[1:5])
print(y_train[1:5])
In [4]:
# investigate the size of the feature matrices
print(X_train.shape)
print(X_test.shape)
So, we have 120 data items for training and 30 for testing.
Now we are nearly ready to train our first classifier. One thing still needs to be said. Classification is a supervised machine learning task. This means, we give the classifier a feature vector together with the desired output (the target). The targets were also loaded from the original data set and reside in the vector $y_{train}$. Putting things together, the classifier gets a matrix, which contains one row for each image and as many columns as we have features. And it also gets a vector of targets, that is as long as we have images. Thus the number of rows in $X_{train}$ is equal to the length of $y_{train}$. Isn't this neat?
Now we finally can train our first model (a model is a trained classifer). The scikit learn library in python uses standard interfaces to all classifiers. This means, no matter which classifier you want to use, the functions you have to call are always named the same (but they might have different parameters).
In [5]:
# initialize the model with standard parameters
clf_nb = MultinomialNB()
# train the model
clf_nb.fit(X_train,y_train)
Out[5]:
Ok, nice. We have trained a model. In the code, the model is called clf_nb
. But, is it a good model? To answer this, we need to evaluate the model on data it has not yet seen, that is on X_test
and the respective labels y_test
.
We do this in two steps:
We ask the classifier about its opinion by only giving it the test data (without the labels). This step is called prediction. We store the results in a vector y_test_pred_nb
.
We count how often the classifier's predictions are the same as the correct labels. This step is called evaluaton. The counting is already conveniently implemented in the library, so we only need to call a function accuracy_score()
which returns us the ratio of correct predictions and total items. If you multiply this ratio by 100 you get a value that can be interpreted as "the classifier is ... percent correct on the test data".
Thus, we can conclude, the classifier has an accuracy of approximately 90%. Or in other words, it misclassifies 10% of the examples. Is this good or bad? Has it learned something? What if we got a value of 50%. Would this be good?
Whether it has learned something can be answered quite easily. We could simply compared it to random guessing. There are 3 classes in the data set. In the test set, there is an equal amount of examples for each class. Or in other words, the examples are uniformly distributed over the classes. You could easily check this by inspecting y_test
. If the classifier would randomly guess, which digit it sees, it would have a 33% chance of getting it right. So, it has learned quite a lot.
In [6]:
# make predictions with the NB classifier
y_test_pred_nb = clf_nb.predict(X_test);
a_nb = accuracy_score(y_test, y_test_pred_nb);
print(a_nb)
In [7]:
clf_dt = DecisionTreeClassifier();
clf_dt.fit(X_train,y_train)
Out[7]:
In [8]:
# make predictions with the decision tree classifier
y_test_pred_dt = clf_dt.predict(X_test)
a_dt = accuracy_score(y_test, y_test_pred_dt)
print(a_dt)
Can we find out more about the mistakes both models still make? If we could, we could probably find ways to improve it. Or it might also be the case that we might find errors in the underlying data (e.g. mislabeled images, images that do not contain digits at all). The latter case is in this example rather unlikely, since this data set has been studied already for a long time and by many different researchers and practicioners.
One thing we could ask is which digits get often confused with one another. Or more generally, which classes often get confused? We can easily asses this, since we have the predictions and the true labels. So, for each digit we just have to count how often label $l$ in the ground truth is predicted as label $k$. We display this in matrix form, this matrix is called class confusion matrix $C$. Entry $(i,j)$ in this matrix holds the count of how often the target $i$ was predicted as $j$.
The strength of the confusion (i.e., the total number of misclassified examples) is indicated with a color in the respective cell.
In [9]:
# get the confusion matrices for both classifiers
cm_nb = confusion_matrix(y_test, y_test_pred_nb);
cm_dt = confusion_matrix(y_test, y_test_pred_dt);
# plot the confusion matrices nicely
plt.subplot(1, 2, 1)
plt.title('Decision Tree', fontsize=16)
plt.imshow(cm_dt, interpolation='nearest',cmap=plt.cm.binary);
plt.tight_layout();
plt.colorbar();
plt.ylabel('True label');
plt.xlabel('Predicted label');
plt.xticks(np.arange(3));
plt.yticks(np.arange(3));
plt.subplot(1, 2, 2)
plt.title('Naive Bayes', fontsize=16)
plt.imshow(cm_nb, interpolation='nearest',cmap=plt.cm.binary);
plt.tight_layout();
plt.colorbar();
plt.ylabel('True label');
plt.xlabel('Predicted label');
plt.xticks(np.arange(3));
plt.yticks(np.arange(3));
Both plots show a nice dark diagonal. This indicates that most of the examples are predicted correctly (as we know from the accuracy measures). TBD: some more detailed conclusion..
We have seen a very basic machine learning pipeline. We trained two different classifiers on the same data set and compared them. There are, of course as always ;-) many more things one could try, some of them are:
That's all for today.
In [ ]: